Skip to content

Fix: API Thread held forever during force deleting across MS#12968

Merged
DaanHoogland merged 2 commits intoapache:4.22from
shapeblue:422-fix-thread-held-forever-force-delete
Apr 15, 2026
Merged

Fix: API Thread held forever during force deleting across MS#12968
DaanHoogland merged 2 commits intoapache:4.22from
shapeblue:422-fix-thread-held-forever-force-delete

Conversation

@nvazquez
Copy link
Copy Markdown
Contributor

@nvazquez nvazquez commented Apr 6, 2026

Description

This PR fixes indefinite hang on deleteHost operation for multiple management server environments. In a multi-management-server (clustered) environment, a forced deleteHost API call causes the calling MS to hang indefinitely, eventually exhausting API threads and rendering the entire environment unresponsive (502 gateway errors).

Fixed by adding:

  • Propagate isForced and isForceDeleteStorage flags across management servers via PropagateResourceEventCommand
  • Catch RuntimeException in ClusterDispatcher.dispatch() so that unexpected failures return a proper error response instead of silently hanging

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • Build/CI
  • Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

@nvazquez
Copy link
Copy Markdown
Contributor Author

nvazquez commented Apr 6, 2026

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@nvazquez a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 6, 2026

Codecov Report

❌ Patch coverage is 14.28571% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.60%. Comparing base (4708121) to head (64e346e).
⚠️ Report is 58 commits behind head on 4.22.

Files with missing lines Patch % Lines
...cloud/agent/api/PropagateResourceEventCommand.java 0.00% 12 Missing ⚠️
...cloud/agent/manager/ClusteredAgentManagerImpl.java 0.00% 9 Missing ⚠️
...n/java/com/cloud/resource/ResourceManagerImpl.java 35.71% 9 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.22   #12968      +/-   ##
============================================
- Coverage     17.60%   17.60%   -0.01%     
- Complexity    15677    15678       +1     
============================================
  Files          5918     5918              
  Lines        531681   531711      +30     
  Branches      65005    65008       +3     
============================================
- Hits          93623    93622       -1     
- Misses       427498   427526      +28     
- Partials      10560    10563       +3     
Flag Coverage Δ
uitests 3.70% <ø> (ø)
unittests 18.67% <14.28%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 17371

@nvazquez
Copy link
Copy Markdown
Contributor Author

nvazquez commented Apr 6, 2026

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@nvazquez a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✖️ debian ✔️ suse15. SL-JID 17372

@nvazquez
Copy link
Copy Markdown
Contributor Author

nvazquez commented Apr 6, 2026

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@nvazquez a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

Copy link
Copy Markdown
Contributor

@sureshanaparti sureshanaparti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-15815)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 50785 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr12968-t15815-kvm-ol8.zip
Smoke tests completed. 149 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm, how can this be tested, @nvazquez

@nvazquez
Copy link
Copy Markdown
Contributor Author

@DaanHoogland this was impacting large multi management server environments on force host deletion. While it may be hard to replicate, no regressions must be observed on multiple management server environments when force removing hosts

Copy link
Copy Markdown
Member

@kiranchavala kiranchavala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Tested manually

Deploy a multi-management servers CloudStack environment
Add a kvm host such that it is owned/connected to MS-1
Issue the API call (deletehost) with force option from MS-2
Kvm Host successfully deleted


2026-04-15 05:04:28,540 DEBUG [c.c.a.ApiServlet] (qtp1390913202-7010:[ctx-b2bd9068]) (logid:acab41d5) ===START===  10.0.3.251 -- POST
command=deleteHost
response=json
id=1d668143-f03d-4d1f-8155-35d31c656c87
forced=true
sessionkey=GDVEQO5rpCmmxA_VCngA2VlqSnM

2026-04-15 05:04:28,540 DEBUG [c.c.a.ApiServlet] (qtp1390913202-7010:[ctx-b2bd9068]) (logid:acab41d5) Two factor authentication is already verified for the user 2, so skipping
2026-04-15 05:04:28,550 DEBUG [c.c.a.ApiServer] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) CIDRs from which account 'Account [{"accountName":"admin","id":2,"uuid":"630244b9-3739-11f1-9978-1e00e20001cf"}]' is allowed to perform API calls: 0.0.0.0/0,::/0
2026-04-15 05:04:28,553 INFO  [o.a.c.a.DynamicRoleBasedAPIAccessChecker] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) Account for user id 6302a12a-3739-11f1-9978-1e00e20001cf is Root Admin or Domain Admin, all APIs are allowed.
2026-04-15 05:04:28,553 DEBUG [o.a.c.a.StaticRoleBasedAPIAccessChecker] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) RoleService is enabled. We will use it instead of StaticRoleBasedAPIAccessChecker.
2026-04-15 05:04:28,553 DEBUG [o.a.c.r.ApiRateLimitServiceImpl] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) API rate limiting is disabled. We will not use ApiRateLimitService.

2026-04-15 05:04:28,558 DEBUG [c.c.r.ResourceManagerImpl] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) Propagating resource request event:DeleteHost to agent:2

2026-04-15 05:04:28,558 DEBUG [c.c.c.ClusterManagerImpl] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) 32985382387974 -> 32989140484559.2 [{"com.cloud.agent.api.PropagateResourceEventCommand":{"hostId":2,"event":"DeleteHost","forced":true,"forceDeleteStorage":true,"contextMap":{},"wait":0,"bypassHostMaintenance":false}}]

2026-04-15 05:04:28,562 DEBUG [c.c.c.ClusterManagerImpl] (Cluster-Worker-2:[ctx-64afa57f]) (logid:f503771c) Cluster PDU 32985382387974 -> 32989140484559. agent: 2, pdu seq: 12473, pdu ack seq: 0, json: [{"com.cloud.agent.api.PropagateResourceEventCommand":{"hostId":2,"event":"DeleteHost","forced":true,"forceDeleteStorage":true,"contextMap":{},"wait":0,"bypassHostMaintenance":false}}]
2026-04-15 05:04:28,562 DEBUG [c.c.c.ClusterServiceServletImpl] (Cluster-Worker-2:[ctx-64afa57f]) (logid:f503771c) Executing ClusterServicePdu with service URL: https://10.0.33.60:9090/clusterservice
2026-04-15 05:04:28,564 DEBUG [c.c.c.ClusterServiceServletImpl] (Cluster-Worker-2:[ctx-64afa57f]) (logid:f503771c) POST https://10.0.33.60:9090/clusterservice response :true, responding time: 2 ms
2026-04-15 05:04:28,564 DEBUG [c.c.c.ClusterManagerImpl] (Cluster-Worker-2:[ctx-64afa57f]) (logid:f503771c) Cluster PDU 32985382387974 -> 32989140484559 completed. time: 2ms. agent: 2, pdu seq: 12473, pdu ack seq: 0, json: [{"com.cloud.agent.api.PropagateResourceEventCommand":{"hostId":2,"event":"DeleteHost","forced":true,"forceDeleteStorage":true,"contextMap":{},"wait":0,"bypassHostMaintenance":false}}]
2026-04-15 05:04:28,624 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (Cluster-Worker-7:[ctx-4790e955]) (logid:fad31311) Dispatch ->2, json: [{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":2,"event":"AgentDisconnected","contextMap":{},"wait":0,"bypassHostMaintenance":false}}]
2026-04-15 05:04:28,625 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (Cluster-Worker-7:[ctx-4790e955]) (logid:fad31311) Intercepting command for agent change: agent 2 event: AgentDisconnected
2026-04-15 05:04:28,625 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (Cluster-Worker-7:[ctx-4790e955]) (logid:fad31311) Received agent disconnect event for host 2 (null)
2026-04-15 05:04:28,625 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (Cluster-Worker-7:[ctx-4790e955]) (logid:fad31311) Result is true
2026-04-15 05:04:28,627 DEBUG [c.c.c.ClusterManagerImpl] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) 32985382387974 -> 32989140484559.2 completed. result: [{"com.cloud.agent.api.Answer":{"result":true,"contextMap":{},"wait":0,"bypassHostMaintenance":false}}]
2026-04-15 05:04:28,627 DEBUG [c.c.r.ResourceManagerImpl] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) Result for agent change is true
2026-04-15 05:04:28,628 DEBUG [c.c.a.ApiServlet] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) ===END===  10.0.3.251 -- POST
command=deleteHost
response=json
id=1d668143-f03d-4d1f-8155-35d31c656c87
forced=true
sessionkey=GDVEQO5rpCmmxA_VCngA2VlqSnM

Tested delete host api without the force option parameter

Host deleted successfully


2026-04-15 05:15:47,621 DEBUG [c.c.a.ApiServlet] (qtp1390913202-23:[ctx-1bf5b649]) (logid:832f240c) ===START===  10.0.3.251 -- POST
command=deleteHost
response=json
id=7d807b7c-092c-4994-86e1-3b751bbca11e
sessionkey=GDVEQO5rpCmmxA_VCngA2VlqSnM

2026-04-15 05:15:47,621 DEBUG [c.c.a.ApiServlet] (qtp1390913202-23:[ctx-1bf5b649]) (logid:832f240c) Two factor authentication is already verified for the user 2, so skipping
2026-04-15 05:15:47,628 DEBUG [c.c.a.ApiServer] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) CIDRs from which account 'Account [{"accountName":"admin","id":2,"uuid":"630244b9-3739-11f1-9978-1e00e20001cf"}]' is allowed to perform API calls: 0.0.0.0/0,::/0
2026-04-15 05:15:47,630 INFO  [o.a.c.a.DynamicRoleBasedAPIAccessChecker] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) Account for user id 6302a12a-3739-11f1-9978-1e00e20001cf is Root Admin or Domain Admin, all APIs are allowed.
2026-04-15 05:15:47,630 DEBUG [o.a.c.a.StaticRoleBasedAPIAccessChecker] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) RoleService is enabled. We will use it instead of StaticRoleBasedAPIAccessChecker.
2026-04-15 05:15:47,630 DEBUG [o.a.c.r.ApiRateLimitServiceImpl] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) API rate limiting is disabled. We will not use ApiRateLimitService.
2026-04-15 05:15:47,633 DEBUG [c.c.r.ResourceManagerImpl] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) Propagating resource request event:DeleteHost to agent:5

2026-04-15 05:15:47,633 DEBUG [c.c.c.ClusterManagerImpl] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) 32985382387974 -> 32989140484559.5 [{"com.cloud.agent.api.PropagateResourceEventCommand":{"hostId":5,"event":"DeleteHost","forced":false,"forceDeleteStorage":true,"contextMap":{},"wait":0,"bypassHostMaintenance":false}}]

2026-04-15 05:15:47,633 DEBUG [c.c.c.ClusterManagerImpl] (Cluster-Worker-3:[ctx-f561d9a6]) (logid:12c9c152) Cluster PDU 32985382387974 -> 32989140484559. agent: 5, pdu seq: 12535, pdu ack seq: 0, json: [{"com.cloud.agent.api.PropagateResourceEventCommand":{"hostId":5,"event":"DeleteHost","forced":false,"forceDeleteStorage":true,"contextMap":{},"wait":0,"bypassHostMaintenance":false}}]

2026-04-15 05:15:47,633 DEBUG [c.c.c.ClusterServiceServletImpl] (Cluster-Worker-3:[ctx-f561d9a6]) (logid:12c9c152) Executing ClusterServicePdu with service URL: https://10.0.33.60:9090/clusterservice
2026-04-15 05:15:47,636 DEBUG [c.c.c.ClusterServiceServletImpl] (Cluster-Worker-3:[ctx-f561d9a6]) (logid:12c9c152) POST https://10.0.33.60:9090/clusterservice response :true, responding time: 2 ms
2026-04-15 05:15:47,636 DEBUG [c.c.c.ClusterManagerImpl] (Cluster-Worker-3:[ctx-f561d9a6]) (logid:12c9c152) Cluster PDU 32985382387974 -> 32989140484559 completed. time: 2ms. agent: 5, pdu seq: 12535, pdu ack seq: 0, json: [{"com.cloud.agent.api.PropagateResourceEventCommand":{"hostId":5,"event":"DeleteHost","forced":false,"forceDeleteStorage":true,"contextMap":{},"wait":0,"bypassHostMaintenance":false}}]
2026-04-15 05:15:47,685 DEBUG [c.c.c.ClusterManagerImpl] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) 32985382387974 -> 32989140484559.5 completed. result: [{"com.cloud.agent.api.Answer":{"result":true,"contextMap":{},"wait":0,"bypassHostMaintenance":false}}]
2026-04-15 05:15:47,685 DEBUG [c.c.r.ResourceManagerImpl] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) Result for agent change is true
2026-04-15 05:15:47,685 DEBUG [c.c.a.ApiServlet] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) ===END===  10.0.3.251 -- POST
command=deleteHost
response=json
id=7d807b7c-092c-4994-86e1-3b751bbca11e
sessionkey=GDVEQO5rpCmmxA_VCngA2VlqSnM

@DaanHoogland DaanHoogland merged commit 160876c into apache:4.22 Apr 15, 2026
25 of 26 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Apache CloudStack 4.22.1 Apr 15, 2026
@DaanHoogland DaanHoogland deleted the 422-fix-thread-held-forever-force-delete branch April 15, 2026 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants